Security & Privacy
Django: Detecting Trojans in Object Detection Models via Gaussian Focus Calibration
Object detection models are vulnerable to backdoor or trojan attacks, where an attacker can inject malicious triggers into the model, leading to altered behavior during inference. As a defense mechanism, trigger inversion leverages optimization to reverse-engineer triggers and identify compromised models. While existing trigger inversion methods assume that each instance from the support set is equally affected by the injected trigger, we observe that the poison effect can vary significantly across bounding boxes in object detection models due to its dense prediction nature, leading to an undesired optimization objective misalignment issue for existing trigger reverse-engineering methods. To address this challenge, we propose the first object detection backdoor detection framework Django (Detecting Trojans in Object Detection Models via Gaussian Focus Calibration). It leverages a dynamic Gaussian weighting scheme that prioritizes more vulnerable victim boxes and assigns appropriate coefficients to calibrate the optimization objective during trigger inversion. In addition, we combine Django with a novel label proposal pre-processing technique to enhance its efficiency. We evaluate Django on 3 object detection image datasets, 3 model architectures, and 2 types of attacks, with a total of 168 models. Our experimental results show that Django outperforms 6 state-of-the-art baselines, with up to 38% accuracy improvement and 10x reduced overhead.
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
As large language models (LLMs) become increasingly prevalent across many realworld applications, understanding and enhancing their robustness to adversarial attacks is of paramount importance. Existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations.
United We Stand, Divided We Fall: Fingerprinting Deep Neural Networks via Adversarial Trajectories
In recent years, deep neural networks (DNNs) have witnessed extensive applications, and protecting their intellectual property (IP) is thus crucial. As a noninvasive way for model IP protection, model fingerprinting has become popular. However, existing single-point based fingerprinting methods are highly sensitive to the changes in the decision boundary, and may suffer from the misjudgment of the resemblance of sparse fingerprinting, yielding high false positives of innocent models. In this paper, we propose ADV-TRA, a more robust fingerprinting scheme that utilizes adversarial trajectories to verify the ownership of DNN models. Benefited from the intrinsic progressively adversarial level, the trajectory is capable of tolerating greater degree of alteration in decision boundaries. We further design novel schemes to generate a surface trajectory that involves a series of fixed-length trajectories with dynamically adjusted step sizes. Such a design enables a more unique and reliable fingerprinting with relatively low querying costs. Experiments on three datasets against four types of removal attacks show that ADV-TRA exhibits superior performance in distinguishing between infringing and innocent models, outperforming the state-of-the-art comparisons.
Order-Invariant Cardinality Estimators Are Differentially Private
We consider privacy in the context of streaming algorithms for cardinality estimation. We show that a large class of algorithms all satisfy ษ-differential privacy, so long as (a) the algorithm is combined with a simple down-sampling procedure, and (b) the input stream cardinality is ฮฉ(k/ษ). Here, k is a certain parameter of the sketch that is always at most the sketch size in bits, but is typically much smaller. We also show that, even with no modification, algorithms in our class satisfy (ษ, ฮด)-differential privacy, where ฮด falls exponentially with the stream cardinality. Our analysis applies to essentially all popular cardinality estimation algorithms, and substantially generalizes and tightens privacy bounds from earlier works. Our approach is faster and exhibits a better utility-space tradeoff than prior art.
Stress-Testing Capability Elicitation With Password-Locked Models
To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elicit capabilities. To do this, we introduce password-locked models, LLMs fine-tuned such that some of their capabilities are deliberately hidden.
Is AI porn the next horizon in self-pleasure -- and is it ethical?
The AI revolution is well and truly upon us. As we grapple with the ramifications of generative AI in our professional and personal worlds, it's worth remembering that its impact will be felt in even the most intimate corners of our lives -- including our private browsers. Whether you're aware of it or not, AI is coming for the porn industry. Already, there are a number of new genres emerging which make use of generative AI, such as hyper porn, a genre of erotic imagery which stretches the limits of sexuality and human anatomy to hyperbolic new heights (think: a Barbie-esque woman with three giant breasts, instead of two). There are also various iterations of'gone wild' porn, a subdivision of porn which sees users attempt to'trick' safe-for-work image generation models like Dall-E into depicting erotic scenes -- and enjoying the work-arounds and euphemisms which these tools may use to avoid depicting explicit sex.
Improving with A Dynamic Discriminator Supplementary Material
This supplementary material is organized as follows. We first discuss the broader impact of the proposed DynamicD in Sec. A. More implementation details are provided in Sec. B to ensure the reproduction. Addtionally, we present the analysis of various sub-nets in Sec. D presents the training dynamics for the further analysis.